Python基本数据统计,一---- 便捷数据获取 & 数据准备和整理 & 数据显示

1. 便捷数据获取

  1.1 本地数据获取:文件的打开,读写和关闭(另外的单独章节)

  1.2 网络数据获取:

    1.2.1 urllib, urllib2, httplib, httplib2 (python3中为urllib.request, http.client)

      正则表达式(另外的单数章节)

    1.2.2 通过matplotlib.finace模块获取雅虎财经上的数据

In [7]: from matplotlib.finance import quotes_historical_yahoo_ochl

In [8]: from datetime import date

In [9]: from datetime import datetime

In [10]: import pandas as pd

In [11]: today = date.today()

In [12]: start = (today.year-1, today.month, today.day)

In [14]: quotes = quotes_historical_yahoo_ochl('AXP', start, today)  # 获取数据

In [15]: fields = ['date', 'open', 'close', 'high', 'low', 'volume']

In [16]: list1 = []

In [18]: for i in range(0,len(quotes)):
    ...:     x = date.fromordinal(int(quotes[i][0]))  # 取每一行的第一列,通过date.fromordinal设置为日期数据类型
    ...:     y = datetime.strftime(x,'%Y-%m-%d')  # 通过datetime.strftime把日期设置为指定格式
    ...:     list1.append(y)  # 将日期放入列表中
    ...:     

In [19]: quotesdf = pd.DataFrame(quotes,index=list1,columns=fields)  # index设置为日期,columns设置为字段

In [20]: quotesdf = quotesdf.drop(['date'],axis=1)  # 删除date列

In [21]: print quotesdf
                 open      close       high        low      volume
2016-01-20  60.374146  61.835916  62.336256  60.128882   9043800.0
2016-01-21  61.806486  61.453305  63.101479  61.325767   8992300.0
2016-01-22  57.283819  54.016907  57.774347  53.114334  43783400.0

    1.2.3 通过自然语言工具包NLTK获取语料库等数据

      1. 下载nltk:pip install nltk

      2. 下载语料库:

In [1]: import nltk

In [2]: nltk.download()
NLTK Downloader
---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> d

Download which package (l=list; x=cancel)?
  Identifier> gutenberg
    Downloading package gutenberg to /root/nltk_data...
      Package gutenberg is already up-to-date!

      3. 获取数据:

In [3]: from nltk.corpus import gutenberg

In [4]: print gutenberg.fileids()
[u'austen-emma.txt', u'austen-persuasion.txt', u'austen-sense.txt', u'bible-kjv.txt', u'blake-poems.txt', u'bryant-stories.txt', u'burgess-busterbrown.txt', u'carroll-alice.txt', u'chesterton-ball.txt', u'chesterton-brown.txt', u'chesterton-thursday.txt', u'edgeworth-parents.txt', u'melville-moby_dick.txt', u'milton-paradise.txt', u'shakespeare-caesar.txt', u'shakespeare-hamlet.txt', u'shakespeare-macbeth.txt', u'whitman-leaves.txt']

In [5]: texts = gutenberg.words('shakespeare-hamlet.txt')

In [6]: texts
Out[6]: [u'[', u'The', u'Tragedie', u'of', u'Hamlet', u'by', ...]

2. 数据准备和整理

  2.1 quotes数据加入[ 列 ]属性名

In [79]: quotesdf = pd.DataFrame(quotes)

In [80]: quotesdf
Out[80]: 
            0          1          2          3          4           5
0    735983.0  60.374146  61.835916  62.336256  60.128882   9043800.0
1    735984.0  61.806486  61.453305  63.101479  61.325767   8992300.0
2    735985.0  57.283819  54.016907  57.774347  53.114334  43783400.0
3    735988.0  53.428272  53.977664  54.713455  53.114334  18498300.0

[253 rows x 6 columns]

In [81]: fields = ['date','open','close','high','low','volume']

In [82]: quotesdf = pd.DataFrame(quotes,columns=fields)  # 设置列属性名称

In [83]: quotesdf
Out[83]: 
         date       open      close       high        low      volume
0    735983.0  60.374146  61.835916  62.336256  60.128882   9043800.0
1    735984.0  61.806486  61.453305  63.101479  61.325767   8992300.0
2    735985.0  57.283819  54.016907  57.774347  53.114334  43783400.0
3    735988.0  53.428272  53.977664  54.713455  53.114334  18498300.0

  2.2 quotes数据加入[ index ]属性名

In [84]: quotesdf
Out[84]: 
         date       open      close       high        low      volume
0    735983.0  60.374146  61.835916  62.336256  60.128882   9043800.0
1    735984.0  61.806486  61.453305  63.101479  61.325767   8992300.0
2    735985.0  57.283819  54.016907  57.774347  53.114334  43783400.0

[253 rows x 6 columns]

In [85]: quotesdf = pd.DataFrame(quotes, index=range(1,len(quotes)+1),columns=fields)  # 把index属性从0,1,2...改为1,2,3...

In [86]: quotesdf
Out[86]: 
         date       open      close       high        low      volume
1    735983.0  60.374146  61.835916  62.336256  60.128882   9043800.0
2    735984.0  61.806486  61.453305  63.101479  61.325767   8992300.0
3    735985.0  57.283819  54.016907  57.774347  53.114334  43783400.0

  2.3 日期转换:Gregorian日历表示法 => 普通表示方法

In [88]: from datetime import date

In [89]: firstday = date.fromordinal(735190)

In [93]: firstday
Out[93]: datetime.date(2013, 11, 18)

In [95]: firstday = datetime.strftime(firstday,'%Y-%m-%d')

In [96]: firstday
Out[96]: '2013-11-18'

  2.4 创建时间序列:

In [120]: import pandas as pd

In [121]: dates = pd.date_range('20170101', periods=7)  # 根据起始日期和长度生成日期序列

In [122]: dates
Out[122]: 
DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03', '2017-01-04','2017-01-05', '2017-01-06', '2017-01-07'],dtype='datetime64[ns]', freq='D')

In [123]: import numpy as np

In [124]: dates = pd.DataFrame(np.random.randn(7,3), index=dates, columns=list('ABC'))  # 时间序列当作index,ABC当作列的name属性,表内容为七行三列随机数

In [125]: dates
Out[125]: 
                   A         B         C
2017-01-01  0.705927  0.311453  1.455362
2017-01-02 -0.331531 -0.358449  0.175375
2017-01-03 -0.284583 -1.760700 -0.582880
2017-01-04 -0.759392 -2.080658 -2.015328
2017-01-05 -0.517370  0.906072 -0.106568
2017-01-06 -0.252802 -2.135604 -0.692153
2017-01-07 -0.275184  0.142973 -1.262126

  2.5 练习

In [101]: datetime.now()  # 显示当前日期和时间
Out[101]: datetime.datetime(2017, 1, 20, 16, 11, 50, 43258)
=========================================
In [108]: datetime.now().month  # 显示当前月份
Out[108]: 1

=========================================
In [126]: import pandas as pd

In [127]: dates = pd.date_range('2015-02-01',periods=10)

In [128]: dates
Out[128]: 
DatetimeIndex(['2015-02-01', '2015-02-02', '2015-02-03', '2015-02-04','2015-02-05', '2015-02-06', '2015-02-07', '2015-02-08','2015-02-09', '2015-02-10'],dtype='datetime64[ns]', freq='D')

In [133]: res = pd.DataFrame(range(1,11),index=dates,columns=['value'])

In [134]: res
Out[134]: 
            value
2015-02-01      1
2015-02-02      2
2015-02-03      3
2015-02-04      4
2015-02-05      5
2015-02-06      6
2015-02-07      7
2015-02-08      8
2015-02-09      9
2015-02-10     10

3. 数据显示

  3.1 显示方式:

In [180]: quotesdf2.index  # 显示索引
Out[180]: 
Index([u'2016-01-20', u'2016-01-21', u'2016-01-22', u'2016-01-25',
       ...
       u'2017-01-11', u'2017-01-12', u'2017-01-13', u'2017-01-17',
       u'2017-01-18', u'2017-01-19'],
      dtype='object', length=253)

In [181]: quotesdf2.columns  # 显示列名
Out[181]: Index([u'open', u'close', u'high', u'low', u'volume'], dtype='object')

In [182]: quotesdf2.values  # 显示数据的值
Out[182]: 
array([[  6.03741455e+01,   6.18359160e+01,   6.23362562e+01,
          6.01288817e+01,   9.04380000e+06],
       ..., 
       [  7.76100010e+01,   7.66900020e+01,   7.77799990e+01,
          7.66100010e+01,   7.79110000e+06]])

In [183]: quotesdf2.describe  # 显示数据描述
Out[183]: 
<bound method DataFrame.describe of                  open      close       high        low      volume
2016-01-20  60.374146  61.835916  62.336256  60.128882   9043800.0
2016-01-21  61.806486  61.453305  63.101479  61.325767   8992300.0
2016-01-22  57.283819  54.016907  57.774347  53.114334  43783400.0

  3.2 索引的格式:u 表示unicode编码

  3.3 显示行:

In [193]: quotesdf.head(2)  # 专用方式显示头两行
Out[193]: 
       date       open      close       high        low     volume
1  735983.0  60.374146  61.835916  62.336256  60.128882  9043800.0
2  735984.0  61.806486  61.453305  63.101479  61.325767  8992300.0

In [194]: quotesdf.tail(2)  # 专用方式显示尾两行
Out[194]: 
         date       open      close       high        low     volume
252  736347.0  77.110001  77.489998  77.610001  76.510002  5988400.0
253  736348.0  77.610001  76.690002  77.779999  76.610001  7791100.0

In [195]: quotesdf[:2]  # 切片方式显示头两行
Out[195]: 
       date       open      close       high        low     volume
1  735983.0  60.374146  61.835916  62.336256  60.128882  9043800.0
2  735984.0  61.806486  61.453305  63.101479  61.325767  8992300.0

In [197]: quotesdf[251:]  # 切片方式显示尾两行
Out[197]: 
         date       open      close       high        low     volume
252  736347.0  77.110001  77.489998  77.610001  76.510002  5988400.0
253  736348.0  77.610001  76.690002  77.779999  76.610001  7791100.0

4. 数据选择

5. 简单统计与处理

6. Grouping

7. Merge