Components of a Data Analysis System - PowerPoint PPT Presentation

About This Presentation
Title:

Components of a Data Analysis System

Description:

Procedural with byte-wise compiling (performance) History, min-match or command ... A simple display for all observation types more important than sophisticated ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 22
Provided by: ronaldjm
Learn more at: https://www.gb.nrao.edu
Category:

less

Transcript and Presenter's Notes

Title: Components of a Data Analysis System


1
Components of a Data Analysis System
  • Scientific Drivers in the Design of an Analysis
    System

2
Data Import
  • Format
  • Either widely used/accepted, or
  • Can be converted easily from something widely
    used
  • User need not know the details of the format
  • Well documented (e.g., which flavor of latitude).
  • Fast Access
  • Disk I/O speeds do not follow Moores law
  • Read speed is more important than write speed
  • Caching
  • File size is only important to keep access times
    low
  • Content must represent the details of the data
  • E2E - Full intent of the observer must be embedded

3
Data Export
  • Format
  • Either widely used/accepted, or
  • Can be converted easily into something widely
    used
  • User need not know the details of the format
  • Well documented (e.g., which flavor of latitude).
  • You can read what you write
  • Import format Export format
  • Fast Access
  • Disk I/O speeds do not follow Moores law
  • Read speed is more important than write speed
  • Content must represent the details of the data
  • E2E - Full intent of the observer must be
    embedded.
  • Includes user annotation/comments

4
Data Base System
  • Ability to work with more than one data set
  • Data base for both export and import files
  • Large data volumes
  • Access using scan numbers is no longer sufficient
  • Require the ability to select subsets of data via
    sophisticated data-base queries
  • Moderate number of columns in data base index
  • Index to data kept in memory to speed data
    access
  • File summaries at various levels of detail
  • Various levels of granularity
  • Calibrated and raw data
  • E2E - User can add annotation/comments
  • Security Only the observer can access data

5
Data Archive
  • Write speed more important than read speed.
  • File size is very important
  • Cannot anticipate types of user queries
  • Large number of columns in data base index
  • Very sophisticated/fast RDBMS
  • Storage need not be a widely used data format
  • Format can be very different from that used by
    analysis system.
  • Export format should be a widely used data format

6
Interactive On-Line Data Analysis
  • The ability to access data ASAP
  • Import file updates automatically as observations
    proceed (real-time filler).
  • Index to file updates automatically
  • Updates happen per integration (spectral-line)
    or per N seconds (continuum)
  • Minimum integration time few times the minimum
    time of real-time filler
  • Analysis system automatically is aware of updated
    index.
  • Read-protect online/filled data?
  • User should be able to see the data within an
    integration of when it was taken (or N seconds).

7
User Interface
  • Command line
  • Familiar syntax better than a good syntax
  • Procedural with byte-wise compiling (performance)
  • History, min-match or command completion
  • Useful error messages
  • Interruptible
  • Error trapping and exception handling
  • Ability to Undo

8
User Interface
  • GUIs best for
  • Interacting with data visualizations
  • Filling in forms
  • data base queries
  • options for data pipelines
  • Browsing for data files
  • Defining E2E data flow (ala labview)

9
Imaging Tools
  • Visualization
  • Shouldnt try to recreate those things already
    available in another package export instead.
  • Data Flagging Pick a system that works
  • Graphics
  • Traditional capabilities (zoom in/out, scroll,
    print, save, )
  • Data volume requires great performance, smart
    libraries (screen resolution ltlt data pts)
  • Interactive feedback (e.g., defining baseline
    regions).
  • Publishable plots or export into something else?
  • Default plot style
  • Ability to tweak everything (label formats char
    sizes add, remove, move annotation tick mark
    size major/minor ticks, full box grid multiple
    X and Y axes, ..)

10
Analysis Algorithms
  • Algorithms well documented
  • Study what exists in other packages.
  • Robustness very important but so is speed
  • Provide less robust but faster alternatives
  • Developers should not force an algorithm on users
  • Developers should provide defaults only
  • Building blocks better than a do-all algorithm.
  • Ability to use and modify header information as
    well as data.
  • E2E do-alls are built out of the same building
    blocks.

11
Documentation
  • On-line and hardcopy
  • Tutorials/Quick Guides
  • Cookbook
  • Based on observing types
  • Reference Manuals
  • Full, gory details
  • Data Formats
  • Algorithms
  • Searchable by keywords
  • Quick, interactive command help from within the
    system.
  • Never release until these are in place

12
User Support/Feedback
  • A familiar system minimizes staff support
  • Easily accessed, on-line help desk and
    Suggestion box
  • Automatic generation of bug reports
  • Observers of observers

13
Marketing
  • A familiar system already has a market
  • Dont be another cereal on the supermarket shelf
  • Workshops are better than papers
  • Create a User Community
  • Responsive feedback from developers
  • Independent Beta testers
  • Reputation first experiences are everything

14
User Community
  • User Forums
  • Newsletters
  • Accept User Contributions/Additions
  • Sourceforge-like system
  • NRAO-seal-of-approval
  • NRAO Moderator

15
Real-Time Data Display
  • To guarantee data quality
  • Product is not stored (except for hardcopy)
  • Sequential processing -- different from E2E/Data
    pipeline
  • Fast is more important than accurate
  • Few bells and whistles -- must avoid the RTD
    black hole
  • A simple display for all observation types more
    important than sophisticated displays for a few
    data types
  • Display happens within an integration of when
    data were taken tied to real time filler
  • GUI based underlying language is unimportant
  • Output understandable by an operator

16
Real Time Data Analysis
  • Pointing/Focus/Tipping/ are different from RTD
  • Results should be stored (Data Base)
  • Results are used by the control system
    (pointing/focus) or by subsequent analysis
    (tipping)
  • Accuracy is as important as speed
  • More bells, whistles, user-options
  • Sequential processing (non E2E/data pipeline)
  • Only a few observation types are handled
  • Analysis happens within an integration of when
    data were taken
  • GUI based underlying language is unimportant
  • Output understandable by an operator

17
IDL Work Package
  • SDFITS
  • Interim solution for data import/export
  • Class/IDL specific soon Aips/Aips/UniPOPS?
  • MD/BDFITS next generation (keywords,
    incompleteness of contents, versatility, )
  • IDL Tom Bania
  • Uses UniPOPS as a model familiar to many
  • Very good reproduction
  • Bania-centric needs to be generalized

18
IDL Work Package
  • Glen Langston
  • Assess whether IDL will meet performance,
    extensibility, usability, goals.
  • Generalization to other observing types.
  • Real-Time data access and display
  • Developed on top of and in parallel with Toms
    work (so, implementations have diverged)
  • Works well for Glens own experiments

19
IDL Work Package
  • Institutionalize what Tom and Glen have done
  • Code management
  • Code review
  • Combine Tom and Glens branch
  • Generalize code
  • Provide ways for Tom and Glen to contribute
    within the same revision-control branch.
  • Develop Institutionalized code
  • Improve performance, usability, maintenance
  • Add/Replace I/O components with better CS
    methods.

20
Calibration Work Package
  • User-tunable algorithms
  • Options for the real-time filler sequential
  • Options for E2E pipeline non-sequential
  • Options for interactive data reduction
  • Default algorithms for all observing cases
  • Extensible as new algorithms are developed
  • User-defined/tweaked algorithms
  • Robust and not-so-robust algorithms

21
Calibration Work Package
  • Opacity/atmosphere model
  • Output units
  • Efficiencies
  • Source size
  • Telescope model
  • Tsys(f) estimates
  • Differencing schemes
  • Non-linearities/template fitting/.
Write a Comment
User Comments (0)
About PowerShow.com