Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Python uses Sina API to crawl data\ python Weibo data crawler

2025-01-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/03 Report--

I have long sold a large amount of Weibo data, travel website review data, and provide a variety of designated data crawling services, Message to YuboonaZhang@Yahoo.com. Also welcome to join the social media data exchange group: 99918768

Using Sina API to crawl data (updated on April 16, 2018.8)

2018.4.16 description

Note: today, some people commented that my blog is rubbish, saying that there is something wrong with my code. this blog has a long history and is a blog written by me as a crawler. I am very grateful to the people who can comment on my code, but I can't stand people who are rude and have a bad attitude. If you have something to say, it is the lowest consciousness of a highly educated and highly educated intellectual in modern society.

I have changed the code, if you have any more questions, you are welcome to point it out gently!

At the same time, due to the constant changes in Sina Weibo's own api mechanism, so far, the content of this blog has been limited. For individual developers, the permission you apply for token can only climb your own Weibo, so for those who want to rely on api to climb data, they may not be able to achieve their own goals. If you want to use api to crawl Weibo content, you can only choose to get higher developer rights.

1. First of all, let's see what we get in the end, whether it is what you want to know, and then decide whether to read on.

I mainly crawled the data for about 4 days, and you can see that there are about 3.6 million pieces of data. Because I crawled to make data on my computer, sometimes it was interrupted when the network was off at night, so I can crawl about 1 million of the latest Weibo data a day (because I am calling the latest Weibo API public_timeline).

API documents define a lot of return types (return in json data format, I pick up some information that I think is important and grab it as shown in the figure: about id number, location, number of fans, Weibo content sent, Weibo post time, and so on. Of course, these data can be customized according to your own needs.)

This is probably the content. If you think it will help you, please read on. It's a bit verbose to write a blog for the first time. 2. Preparation

What we need:

Database: mongodb (client-side MongoBooster can be used) Development Environment: Python2.7 (the IDE I use is Pycharm) A Sina developer account: just register with your own Sina Weibo account (later) the required libraries: requests and pymongo (these can be downloaded from Pycharm) 2.1installation of mongodb

MongoDB is a high-performance, open source, schemalless document database, which is one of the most popular NoSql databases. It can be used to replace the traditional relational database or key / value storage in many scenarios. Mongo is developed using C++. The official website address of Mongo is: readers can get more detailed information here.

Episode: what is NoSql?

NoSql, whose full name is Not Only Sql, refers to a non-relational database. The next generation database mainly addresses several key points: non-relational, distributed, open source, and horizontally scalable. Originally intended for large-scale web applications, the campaign began in early 2009 with features such as: free mode, support for easy replication, simple API, ultimate consistency (non-ACID), large-capacity data, etc. NoSQL is most used by us when key-value storage, of course, there are other document, column storage, schema database, xml database and so on.

There are many tutorials for installing mongodb on the Internet, so I won't write them.

Installation of mongodb under Windows

Installation of mongodb under Linux

2.2 how to register a Sina developer account with a Sina Weibo account (163email address, mobile number)

After creation, you need to fill in the mobile phone number for verification.

Enter the Sina Open platform:

Click to continue to create

You need to enter the following information to create an application for the first time:

The information on this page does not need to fill in the real information, such as region, telephone number, you can fill in at will. Just fill in the website. (the mailbox should be real)

Continue to create applications. The application name is customized. Check ios and andrioid for the platform.

After the creation is completed, you can directly return to continue the creation. A single account can create 10 applications, each corresponding to an access-token (in fact, I can only use one to meet the needs)

Go to the API test platform

Select the created application in turn. Click to save the token below with txt.

Get key

Back to

Click my App

Then select the application you just created

Click the application information after entering

Save APP Key and APP Secret

Click Advanced Information

Set callback URL

Can be set to the default

Http://api.weibo.com/oauth3/default.html

At this point, your developer account is complete.

2.3 installation method of dependent libraries

Installation of requests and pymongo

Can be installed directly with pip

Pip install requests and pip install pymongo

It can also be installed directly in Pycharm.

Select File-> Settings-> Project-> Project Interpreter

You can see the Python library installed by yourself, click the green + sign on the right

Just install it.

3. Analyze the problem 3.1 OAuth authentication

Description of authorization mechanism (very important)

On the Internet, many people use Sina Weibo API to send Weibo and so on. They all use the method of requesting users to authorize Token, but this way is obviously not suitable for us to crawl data, because we have to request every time, and we have to get the code again every time. For more information, please see the authorization mechanism of Sina Weibo API.

Teacher Liao Xuefeng (contributor to sinaweibopy) also has an explanation of this authorization mechanism.

Access to the website through Sina Weibo's API, because users do not need to register on your site, can directly? Log in to your site using his / her Sina Weibo account and password, which requires ensuring that your site confirms that the user has logged in without knowing or knowing the user's password. Since the user's password is stored on Sina Weibo, the process of authenticating the user can only be completed by Sina Weibo, but how does Sina Weibo communicate with your website and tell you whether the user has logged in successfully? This process is called third-party login. OAuth is a standard third-party login protocol. With OAuth, your website can securely access users who have successfully logged in from Sina Weibo.

OAuth currently has two main versions, 1.0 and 2.0. version 2.0 makes a lot of simplification to version 1.0, and API is simpler. Sina Weibo's latest API also uses OAuth 2.0. the whole login process is as follows:

The user clicks "sign in using Sina Weibo" on your website, and your website redirects the user to the OAuth authentication page of Sina Weibo. The redirect link contains the client_id parameter as your website ID,redirect_uri parameter to tell Sina Weibo to redirect the browser to your website when the user has logged in successfully. The user enters the account number and password on the authentication page of Sina Weibo. After Sina Weibo authentication is successful, redirect the browser to your website with the code parameter; your website requests the user's access token; to Sina Weibo through the code parameter. After your website gets the user's access token, the user logs in.

OAuth's access token is a token generated by a website that provides authentication services, such as Sina Weibo, that represents a user's authentication information. In the subsequent API call, passing in the access token represents the logged-in user, so that through the OAuth protocol, your website will hand over the steps of verifying the user to Sina Weibo, and Sina Weibo will tell you whether the user has logged in successfully.

The security of OAuth is completed through step 4. The process of obtaining access token through the code parameter is completed when your website backend goes to Sina Weibo, and users cannot see the HTTP request to obtain access token. If the user passes in a fake code, Sina Weibo returns an error.

For details, please see teacher Liao Xuefeng's document.

Roughly speaking, this happens as a general request for a user to authorize a Token call:

Get code

Will be transferred to a connection https://api.weibo.com/oauth3/default.html?code= × × × after logging in.

All we need is the value of code=.

In other words, every time you call API authentication, there will be a code in the browser, which is obviously not good for us to crawl the website.

How to solve the problem? The first thing we think of is to simulate login to Sina Weibo in the Python program, and then we can naturally get the value of code, but the simulation of Sina Weibo login is relatively complex, and since the login simulation is successful, why call API. It is not more convenient to directly customize the crawl.

If you look at the authorization mechanism above, you should think of it. At this time, we need the access-token we applied for before.

As far as I understand it, access-token authorizes your Weibo to a third party to do something for you, similar to logging in through Sina Weibo on your mobile side and then operating (using a sentence in the above authorization mechanism). The mobile application can directly use the official mobile SDK and authorize it by calling the Weibo client (H5 authorization page is called if the Weibo client is not installed).

You should be familiar with this interface.

Sina also gave an explanation of Oauth3/access token.

4. Code implementation

With token, it is very easy to grab data.

How much data you can crawl depends on your token permissions.

The next step is to use API to get the data: create a new file weibo_run.py

#-*-coding:utf-8-*-import requestsfrom pymongo import MongoClientACCESS_TOKEN = '2.00ZooSqFHAgn3D59864ee3170DLjNj'URL =' https://api.weibo.com/2/statuses/public_timeline.json'def run (): # Authorization while True: # call statuses__public_timeline 's api interface params = {'access_token': ACCESS_TOKEN} statuses = requests.get (url=URL Params=params) .json () ['statuses'] length = len (statuses) # this is later in order to check the print length # connection mongodb that I set to get the number of Weibo messages. There is no need for local additional configuration Monclient = MongoClient (' localhost') 27017) the data names obtained by db = Monclient ['Weibo'] WeiboData = db [' HadSelected'] # should be able to clearly see what data for i in range (0) corresponds to Length): created_at = statuses [I] ['created_at'] id = statuses [I] [' user'] ['id'] province = statuses [I] [' user'] ['province'] city = statuses [I] [' user'] ['city'] followers_count = statuses [I] [' user'] ['followers_count'] Friends_count = statuses [I] ['user'] [' friends_count'] statuses_count = statuses [I] ['user'] [' statuses_count'] url = statuses [I] ['user'] [' url'] geo = statuses [I] ['geo'] comments_count = statuses [I] [' comments_count'] reposts_count = statuses [ I] ['reposts_count'] nickname = statuses [I] [' user'] ['screen_name'] desc = statuses [I] [' user'] ['description'] location = statuses [I] [' user'] ['location'] text = statuses [I] [' text'] # insert mongodb WeiboData.insert_one 'created_at': created_at 'id': id, 'nickname': nickname,' text': text, 'province': province,' location': location, 'description': desc,' city': city, 'followers_count': followers_count 'friends_count': friends_count,' statuses_count': statuses_count, 'url': url,' geo': geo, 'comments_count': comments_count,' reposts_count': reposts_count}) if _ _ name__ = = "_ _ main__": run ()

My code looks like this at first, and it looks like it's done.

However, because Sina will limit the number of calls you can make, then I tried to rerun it, and found a problem. Every row of my previous print length got different values, always hovering between 16 and 20, which indicated that the data I got from each rerun was different. Then I thought forget it and write an endless cycle to see when he will be blocked again. So the code looks like this.

Delete run () and replace it with the following dead loop.

If _ _ name__ = = "_ _ main__": while 1: try: run () except: pass

As a result, he kept running. It has been running for four days and has not been sealed, so it is estimated that it will not be sealed.

Other APIs are also used. You only need to change url and params. For more information, please see Sina Weibo API documentation.

At first I found that I could get 8 million of the data a day and give me to Le. Later found a lot of duplicate data. Finally, I found a half-day solution to build an index in mongodb based on the user's id and creation time (because it is impossible for one person to send two Weibo messages at the same time). Finally, without duplicating data, you can get about 1 million pieces of information a day.

Personal blog

8aoy1.cn

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report