一到简单的的小程序：写一个程序，分析一个文本文件，英文文章中各个词出现的频率，并且把频率最高的10个词打印出来

2024-03-30 06:18•html•阅读 3581

解决这个问题只要完成单词的识别和统计结果的排序这连个核心功能即可。剩下的细节可以逐渐完善。首先我想到单词的的储存问题，我第一个想到了链表，因为数组存在着内存分配问题。我想到可以定义一个数据类型来储存单词和它出现的次数，然后在排序即可。而单词的储存要用到字符数组，这样单词的识别只需对输入字符判断截取即可完成。有了大概思路后，我又感觉单词存在字符数组中是由于单词长度不同会浪费空间，于是决定从网上找类似的列子，去寻找更好的方法。

下面是来自网上的代码：

#include<iostream>
#include<fstream>
#include<string>
using namespace std;
class danci
//自定义类用于储存单词和出现次数
{
public:
string name;
int num;
danci(){num=0;name="";};
};
void readfile(danci*&inchar,int &counter)
//读入文件识别并储存单词，统计次数
{
ifstream infile("in.txt");
if(!infile) {cout<<"cannot open!"<<endl;return;}
while(infile)
{ 
string temp;
infile>>temp;
int i=0;
for( ;i<counter;i++)
{
if(temp==inchar[i].name) { inchar[i].num++;break;}
}
if(i==counter&&inchar[i].name!=temp)
{
inchar[counter].name=temp;
inchar[counter].num++;
counter++;
}    
};
infile.close();
}
void outfile(danci*inchar,int counter)//结果输出到文件的函数
{
ofstream outfile("out.txt");
for(int i=0;i<counter;i++)
outfile<<"单词"<<inchar[i].name<<endl<<"出现次数"<<inchar[i].num<<endl;
}
void main()
{
danci*inchar=new danci[1000];
int counter=0;
readfile(inchar,counter);
outfile(inchar,counter);

}

该程序设计的十分简练，而且在不考虑标点符号的情况下能准确同几个单词出现的个数。但也存在很多缺陷，由于使用数组的思想存储单词使得当读入文章过小时会浪费空间，过大时又需要改变参数，很不方便。而且由于读入时作为单词分割的是空格、换行，会把标点误判为单词的一部分，出现类似“happy，”这样的单词。同时本程序单词大小写敏感，我个人认为这不符合要求。于是我对程序做出了修改。

如下：

#include<iostream>
#include<fstream>
#include<string>
using namespace std;
struct Word
//自定义数据类型用于储存单词和出现次数
{
char name[30];
int num;
struct Word *next;
};
void readfile(struct Word*&head）////读入文件识别并储存单词，统计次数
{
    ifstream infile("in.txt");
    infile>>noskipws;
    if(!infile) {cout<<"cannot open!"<<endl;return;}
    char a,temp[30];
    struct Word *p;
    while(infile)
    { 
        int i=0;
        infile.get(a);
        temp[0]=' ';//标记位用于判断是否输入单词
while((a>='a'&&a<='z')||(a>='A'&&a<='Z')||temp[0]==' ')
        {
            if(a>='a'&&a<='z'||a>='A'&&a<='Z')
            {
                temp[i]=a;
                i++;
            }
            infile.get(a);
            if(infile.eof())break;
        }
        temp[i]='\0';
        p=head->next;
        while(p)
        {
            if(!_stricmp(temp,p->name)) 
            { p->num++;break;}
            p=p->next;
        }
        if(!p&&temp[0]!='\0')
        {
                p=new Word;
                strcpy(p->name,temp);
                p->num=1;
                p->next=head->next;
                head->next=p;
        }
    }
    infile.close();
}
void outfile(struct Word*&head)
{
ofstream outfile("out.txt");
struct Word *p;
p=head->next;
while(p)
    {
        outfile<<"单词"<<p->name<<endl<<"出现次数"<<p->num<<endl;
        p=p->next;
    }

}
void out(struct Word*head,int n)
{
    struct Word *p;
    p=head->next;
    for(int i=0;i<n;i++)
    {
        cout<<"单词"<<p->name<<endl<<"出现次数"<<p->num<<endl;
        p=p->next;
    }
}
void getsort(struct Word*&head)
{
    struct Word *p,*q,*s,*l;
    q=head;
    p=head->next;
    s=p->next;
    p->next=NULL;
    while(s)
    {
        while(p&&p->num>s->num)
        {
            q=p;
            p=p->next;
        }
        q->next=s;
        l=s->next;
        s->next=p;
        s=l;
        p=head->next;
        q=head;
    }
}
void main()
{
struct Word *head;
head=new Word;
head->next=NULL;
readfile(head);
getsort(head);
out(head,10);
outfile(head);
}

我为了达到对单词的准确识别，把源程序的字符串存取单词改为用字符数组存取单词再赋给字符串变量name。解决了单词识别,但是有点浪费空间，求大神解决。同时将自定义类型用链表代替数组方便了排序，也克服了空间非配问题。但在文件的存取上遇到了问题。有关文件读取的结尾问题望大神指点。