Title: Mining the Deep Web for Economic Data
1Mining the Deep Web for Economic Data
- Joe Hellerstein
- Hal R. Varian
- UC Berkeley
- http//www.sims.berkeley.edu/hal
- http//www.cs.berkeley.edu/jmh
2The Deep Web
- The deep Web (databases) is about 400 times as
large as the surface Web - There is lots of interesting data thereif it can
be harvested - Examples
- FFF
- Amazon
- Monster
- Traffic
3Federated Facts and Figures
- http//fff.cs.berkeley.edu
- Mine political data
- Collect contributions to the Democratic and
Republican party - Then cross-tab this with other data
- Yahoo celebrities list
- Geographic information system
- Value of real estate in donors neighborhoods
- Clinton pardons list
4Celebrity donors
5Distribution of donors
6Mining Economic Data
- Private sector applications
- Competitive intelligence of various sorts
- Public sector applications
- Economic forecasting
- Labor market histories
- And more.
7Private Sector Applications
- SIMS final projects
- Competitors
- Media Map
- Footprint (a hack)
- Book sales example
- Courtesy of Madeline Schnapp, OReilly Associates
8(No Transcript)
9(No Transcript)
10battleground adversaries attacking catching up
to fights contends opponents
competitive challenge leading market win wins losi
ng
arch-rival compete competes competitors market
share
Lists
Good
Bad
11Evaluation
Competitors
NAICS
87
Precision
16
Recall
36
11
12Footprint
- Companies have to file 10-Ks and cite all
information that is materially relevant to the
value of the company - Potential bad news ends up in footnotes
- SEC rules about visual presentation
- What about computer readable versions at EDGAR?
13Footnotes in SEC Filings
- Our idea extract and highlight footnotes, link
back to 10-K - Add toenotes to interesting footnotes
- See results at http//www.sims.berkeley.edu/hal/f
ootprinthal/footprint - Deeper project does content analysis help
predict stock performance?
14Example of Footprint
-
- Includes 216 million and 128 million in other
current liabilities for 1997 and 1996,
respectively. - Unaffiliated revenues include sales to
unconsolidated subsidiaries
15Intelligence about competitors sales
- Courtesy of Madeline Schnapp, formerly of
OReilly Associates
16(No Transcript)
17(No Transcript)
18Average Rank 5500
19Amazon Rank Calibration 5/2001 Ranks between 1
and 2000
Amazon Rank
Units Sold
20Amazon Rank Calibration 5/2001 Ranks between 1
and 20000
Amazon Data
Units Sold
21How Well Does it Work?
22ADDISON WESLEY TITLES
23CALCULATED WEEKLY SALES
CALCULATED WEEKLY SALES
PUBLISHER
24(No Transcript)
25Monster.com
- Help wanted
- By city
- By occupation
- Jobs wanted
- Resume generator
- Salary aspirations
- Better than help wanted ads since also have
job wanted ads
26Talentmarket
27Resume data
- How many job changes are optimal?
- What is role of big regional or industry
employers in job history? - Do immigrants accept lower wages?
- Regional dynamics (e.g., Silicon Valley, Houston)
- How long does it take before job seekers check
willing to relocate?
28Housing market at Craigslist
29Traffic monitoring (PATH)
Decades of data has been collected by CalTrans
and others
30Database of traffic conditions
31Is Traffic a (leading,lagging) Indicator of
Regional Economic Activity?
- Econometric issues frequency of series,
weekend/weekday - Correlate with help wanted/job wanted and
apartment/housing prices - Split out trucks and cars
- Maybe in future split out full trucks and empty
trucks