Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to use Python to read web pages

2025-04-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/02 Report--

This article is about how to use Python to read web pages. The editor thinks it is very practical, so share it with you as a reference and follow the editor to have a look.

1. Web page text recognition

The reason for using Python is that Python has a rich library, and it is easy to identify the text of the web page. Here I tried readability, goose3. 1.1 readability

Readability supports Python3, which can be installed using pip install readability-lxml.

Readability is also easy to use:

Import requests

From readability import Document

Response = requests.get ('http://news.china.com/socialgd/10000169/20180616/32537640_all.html')

Doc = Document (response.text)

Print (doc.title ())

However, the text extracted by readability is not text, it still contains HTML tags.

Of course, you can also combine other components to deal with HTML, such as html2text, we will not extend here, interested can try their own.

1.2 goose3

Goose was originally an article extractor written in Java, and then there was an implementation of Python: goose3.

It is also very convenient to use, and the support for Chinese is also good. You can install it using pip install goose3.

> from goose3 import Goose

> from goose3.text import StopWordsChinese

> url = 'http://news.china.com/socialgd/10000169/20180616/32537640_all.html'

> g = Goose ({'stopwords_class': StopWordsChinese})

> article = g.extract (url=url)

> print (article.cleaned_text [: 150])

At 23:00 Beijing time (18:00 local time), a match in Group B of the 2018 World Cup was held at St. Petersburg Stadium. Iran narrowly beat Morocco 1-0. Iranian striker Azmon missed the chance of a single knife before half-time. Bahaduz set himself up in the 95th minute.

Ryuu. This is the first time in 20 years that Iran has won the World Cup finals.

In this World Cup, both substitutes and goals were scored one after another to make up for scoring twice and the host.

Can see the web page body extraction effect is good, basically meet our requirements, can be used!

Note: goose also has another version of Python2: Python-Goose, which is used in much the same way as goose3.

2 text to speech

Text to voice, Baidu, Ali, Tencent, iFLYTEK and so on all provide REST API interfaces. Ali and Tencent take a relatively long time to apply, and Ali seems to have to charge a fee. Baidu and iFLYTEK can be used after online applications.

There is no way. There are always twists and turns when good things come. Among them, Baidu has no limit on the amount of transfers (in fact, the default is 200000 times per day), and iFLYTEK has a limit of 500 times per day.

Here we use the language synthesis interface of Baidu's REST API interface, on the one hand, because there is no limit on the number of calls to Baidu, on the other hand, I roughly read the interface document of Xunfei, and there are still many interface restrictions. In addition, Baidu provides Python packaging for REST API, which is more convenient to use.

2.1 use of baidu-aip

Baidu provides Python SDK, which can be installed directly using pip install baidu-aip. For the use of the API, please refer to the API documentation: http://ai.baidu.com/docs#/TTS-Online-Python-SDK/top.

Examples of use are as follows:

From aip import AipSpeech

"

Your APPID AK SK.

Can be viewed in the list of applications in the service console.

"

APP_ID = 'your App ID'

API_KEY = 'your Api Key'

SECRET_KEY = 'your Secret Key'

Client = AipSpeech (APP_ID, API_KEY, SECRET_KEY)

Result = client.synthesis ('Hello, what are you doing', 'zh', 3, {

'vol': 5

})

# identify the correct voice binary error and return dict to refer to the following error code

If not isinstance (result, dict):

With open ('auido.mp3',' wb') as f:

F.write (result)

API parameters:

The parameter type describes the text that must be composed by texString. Use UTF-8 encoding. Please note that the text length must be less than 1024 bytes. Fill in zh is the choice of ctpString client type, and 1 is the unique identity of cuidString user on the web side to distinguish users. Fill in the machine MAC address or IMEI code. The length is 60 with inner no spdString speed. Value is 0-9. Default is 5 medium speed no pitString tone, value 0-9. Default is 5 Chinese tone No volString volume, value is 0-15, default is 5 alto volume No perString pronunciation choice, 0 is female voice, 1 is male voice, 3 is emotional synthesis-degree carefree, 4 is emotional synthesis-degree Yaya, default is ordinary female No

The API limits the text passed in at a time, and the length of the synthetic text must be less than 1024 bytes. If the text length is too long, it needs to be cut and converted into voice files by means of multiple requests. Finally, multiple voice files are merged into one.

2.2 text cutting

You can split the text into multiple 500-length text lists using the following code

# divide text into multiple texts with a length of 500

Text_list = [text [I: iTun500] for i in range (0, len (text), 500)]

2.3 language file merging

We use pydub to process the generated audio files. You can install it using pip install pydub.

In addition, the Ubuntu environment needs to install dependencies, which can be installed using sudo apt-get install libav-tools, while in the Windows environment, you need to download FFmpeg from https://ffmpeg.zeranoe.com/builds/ and configure it into the environment variables.

If you have any more questions, you can refer to the official website configuration: https://github.com/jiaaro/pydub.

# merge audio files

Def merge_voice (file_list):

Voice_dict = {}

Song = None

For iJournal f in enumerate (file_list):

If I = = 0:

Song = AudioSegment.from_file (f, "mp3")

Else:

# stitching audio files

Song + = AudioSegment.from_file (f, "mp3")

# Delete temporary audio

Os.unlink (f)

# Export the merged audio file in MP3 format

File_name = str (uuid.uuid1 ()) + ".mp3"

Song.export (file_name, format= "mp3")

Return file_name

Here is a file generated during the test, which you can listen to:

Through Baidu's interface, we can convert text into audio files. The following question is how to play audio files.

3 audio file playback

There are several ways to get Python to play wav files on the Internet, including pyaudio, pygame, winsound and playsound. However, after the test, only playsound succeeded. If you are interested in other ways, you can try it, and if you have any questions, you can leave a message.

It can be used after installation using pip install playsound.

It is also easy to use:

> from playsound import playsound

> > playsound ('/ path/to/a/sound/file/you/want/to/play.mp3')

Note: audio playback needs to be run under a graphical page, because in command line mode, there is no exit for playing sound.

Python page2voice.py-u "https://so.gushiwen.org/shiwenv_c244fc77f6fb.aspx"

After running, the code will automatically parse the page and read it aloud.

Thank you for reading! This is the end of the article on "how to use Python to read web pages". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, you can share it for more people to see!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report