BUILDING NANOBANK - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

BUILDING NANOBANK

Description:

Current ISO list of countries is taken as basis; historical ... state combination. ... the FIPS database (corrected for misspelings, abbreviations, etc... – PowerPoint PPT presentation

Number of Views:88
Avg rating:3.0/5.0
Slides: 21
Provided by: jason99
Category:

less

Transcript and Presenter's Notes

Title: BUILDING NANOBANK


1
BUILDING NANOBANK
  • Data Structure and Selection Criteria
  • Jason Fong and Emre Uyar
  • University of California, Los Angeles

2
What is Nanobank?
  • Nanobank is a collection of observations from
    various sources (scientific articles, patents and
    government grants), determined to be related to
    nanotechnology field, either by probabilistic
    information retrieval (IR) methods or by being
    declared nano by a source authority.

3
Data Sources - Articles
  • 580,711 scientific articles from peer reviewed
    journals.
  • Source Science Citation Index, Arts Humanities
    Citation Index and Social Sciences Citation Index
    of the Institute for Scientific Information Inc.
    (ISI). All together, these indexes contain more
    than 24,250,000 entries from over 8,700 peer
    reviewed scientific journals.

4
Data Sources Patents and Grants
  • 240,437 patents from U.S. Patenting and Trademark
    Offices online database of more than 4,000,000
    patents, granted by USPTO from 1976 to 2006.
  • 52,831 grants from NIH and NSF databases.

5
Data Contents
  • Articles
  • Titles
  • Journal volume and issue numbers
  • Publication years
  • Author names
  • Names and addresses of organizations affiliated
    with authors

6
Data Contents
  • Patents
  • Titles and abstracts
  • Application and grant dates
  • Names and addresses of inventors and assignees
  • U.S. and international patent classifications

7
Data Contents
  • Grants
  • Titles and abstracts
  • Receiving organization names and addresses
  • PI and co-PI names
  • Grant amounts

8
Nanobank Data Structure
  • Internal database
  • Stored in a relational database
  • Separate tables for various data items
  • ID numbers for each item link between tables
  • Version posted on Nanobank.org
  • Denormalized form of internal database
  • Storing redundant data isnt as space-efficient,
    but lessens the need to join multiple tables
  • Nanobank Codebook contains detailed information
    on tables and fields available in each

9
Document Selection
  • Document Selection Methods
  • Keywords
  • Probabilistic
  • Authority-selected
  • Tables include a field to indicate selection
    method
  • nanobank_flag 1 if selected by Keywords or
    Probabilistic 0 otherwise
  • authority_flag 1 if Authority-selected 0
    otherwise

10
Document Selection Keywords
  • Search for text patterns matching words or
    phrases related to nanotechnology
  • Words and phrases chosen by subject specialists
  • Less effective for identifying very early or
    recent documents
  • Early documents were written before the terms
    were in common usage
  • Recent documents have terms that are too new to
    be included in the search patterns

11
Document Selection Probabilistic
  • Incorporates new terms as they come into common
    usage
  • Uses the Xapian search engine library to perform
    ranking calculations
  • Analyzes document text and ranks against a set of
    query terms

12
Document Selection Probabilistic
  • Initial query terms from the Virtual Journal of
    Nanoscale Science Technology (VJNano)
  • All articles in VJNano assumed to be relevant
  • Select highest ranked terms
  • Document selection process
  • Use initial query terms to select relevant
    documents from all journal articles
  • Select additional terms from those relevant
    documents and add to query
  • Repeat selection with expanded query terms

13
Document Selection Authority Set
  • Articles
  • Listed in the Virtual Journal of Nanoscale
    Science Technology
  • Patents
  • Listed under United States Patent Classification
    Class 977 (Nanotechnology)
  • NSF Grants
  • program name contains nano
  • NIH Grants
  • NIH descriptive tag contains nano

14
GEOCODING
  • Standardizing between differing naming
    conventions used in different sources.
  • Standardizing between non-uniformity in how
    observations are recorded.
  • Correcting common mistakes.
  • For US observations Providing different grouping
    units (other than city and state) not available
    in original data sources, like counties and BEA
    areas.

15
COUNTRY GEOCODING
  • Country names in all observations are cleaned,
    standardized and assigned an ISO code (2 digit
    alphabetical)
  • Current ISO list of countries is taken as basis
    historical entries assigned to the closest
    current country to the extend available.

16
US GEOCODING
  • US observations are those in 50 US states, DC and
    7 US associated areas.
  • Cities, states, counties and BEA economic areas
    are coded using Populated Places data obtained
    from FIPS 55 database and BEA.
  • Basis is the city-state combination. City names
    are standardized and matched to the names in FIPS
    database on a state-by-state basis.
  • In articles, 99.98 of US observations have been
    assigned a definite city - state code.

17
US GEOCODING Variables Created
  • Standard_city_name Standardized name as it
    appears on the FIPS database (corrected for
    misspelings, abbreviations, etc...)
  • State_code 2 digit numeric code.
  • City_code 5 digit numeric code, unique by state.
  • County_code 5 digit numeric code.
  • County_name
  • City code state code uniquely determine a
    populated place.
  • Numeric codes are same as the codes used by FIPS.

18
GEOCODING US BEA Areas
  • Bureau of Economic Analysis (BEA) created 179
    Economic Areas in the US by asigning each county
    is assigned to a unique BEA.
  • BEA_code 3 digit numeric code that determines
    the associated BEA Economic area for each
    observation.
  • "BEA's economic areas define the relevant
    regional markets surrounding metropolitan or
    micropolitan statistical areas. They consist of
    one or more economic nodes - metropolitan or
    micropolitan statistical areas that serve as
    regional centers of economic activity and the
    surrounding counties that are economically
    related to the nodes.
  • The economic areas were redefined on November 17,
    2004, and are based on commuting data from the
    2000 decennial population census, on redefined
    statistical areas from OMB (February 2004), and
    on newspaper circulation data from the Audit
    Bureau of Circulations for 2001."

19
ORGANIZATION CODES
  • Each observation is assigned an alpha numerical
    code.
  • 2 digit alphabetical part determines the
    organization type.
  • Numeric part groups names that are same up to
    standardization and hand cleaning

20
Organization Codes Types of Cleaning
  • Standardization of common identifiers
  • IBM IBM Corp. IBM Corporation
  • Univ University University of Universidade
    Universidad Univerzitet Universita
    Universitat Universiti Universite
    Universitet Universiteit
  • Using look up tables and hand cleaning to
    identify common variants (and misspellings) of
    names used by the same organization
  • IBM Int Buisness Machines
  • International Business Machines Corporation
  • Int Business Machines Operation
Write a Comment
User Comments (0)
About PowerShow.com