Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

The World Cup is coming. Write a crawler to get player data.

2025-02-25 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Share

Shulou(Shulou.com)11/24 Report--

Inscription: the World Cup has begun, and everyone has rekindled their enthusiasm for watching the game. For game production, it is often necessary to develop some character data, especially sports games. To set their own workload is heavy, and too subjective, at this time you need to go to some authoritative websites to query data, as a reference. Combined with my own practical experience, the author teaches you to be a simple crawler.

The first step of the preparatory work is to make sure that what we need to crawl is the player data of FIFA23. Through the website https://sofifa.com/, there are all the player data from FIFA07 to FIFA23, which is very detailed. When I open the home page, I find that it looks like this:

We click on one of the players and analyze:

It is found that the required data are all in the location of the above two pictures. Below are the transfer records and user comments, which are not needed right now.

After analysis, it is found that each player has a unique id, which is displayed in the URL url. Whether it's a name search or a jump after a team search, the player's page displays this id.

The last string of numbers, 230006, should be some kind of parameter. After the url is removed, the player page will still open.

After removing the player's name, you can still open the page.

So we understand-everyone's is just a number.

At this point, the important preparations in the early stage have been completed. We have found the pattern, and we need to apply it next.

When I started to install python, I used version 3.9.12. Then install the requests library and beautifulsoup library, you can use pip install requests,pip install beautifulsoup4 to install, or use conda to manage the installation package, on how to install your own search, do not repeat.

First write the main function used to get player data:

# through the player ID, crawl the data, and return a list deffetchData (id): url = f 'https://sofifa.com/player/{str(id)}' myRequest = requests.get (url) soup = BeautifulSoup (myRequest.text,'lxml') myList = [] return myList We get the information of a player through id, so the parameter is id. Just add id and you can crawl all the players' information. If there is no such person, a null value is returned. Note that if the value returned by request is 200, the connection is successful. As for retry and how to set http header, please search by yourself.

Select values on the page press F12 to view the page elements and get the desired values. Each project is different, and the following presentation is what we need.

Meta data has a section of meta data that is not displayed on the page, which records the description of the player. I took this down and used it to make a quick comparison with players of the same name.

Filter year because I want to take the latest FIFA23 data, so I filter the year in the upper left corner. If it is not 23 years, it will return a null value.

The code to the current location:

Def fetchData (id): url = f 'https://sofifa.com/player/{str(id)}' myRequest = requests.get (url) soup = BeautifulSoup (myRequest.text,'lxml') meta = soup.find (attrs= {' name':'description'}) ['content'] years=soup.find (name='span' Attrs= {'class':'bp3-button-text'}) if meta [: 4]! =' FIFA'and (str (years.string))! = "FIFA 23" or meta [: 4] = 'FIFA':#print (years.string +' No 23 years of data') return None info = soup.find (name='div',attrs= {'class':'info'}) playerName = info.h1.string myList = [id PlayerName] basic information such as location, birthday, height, weight, etc. We can see that this is a string.

Here is the use of the whole article, brain-saving approach, is to change selector. Right-click the part that needs to be crawled, select copy selector and copy it to the clipboard.

# get small print information rawdata= soup.select ("# body > div:nth-child (5) > div > div.col.col-12 > div.bp3-card.player > div > div") FYI: you can also use XPath to select it, but you need to learn the syntax of XPath a little. Chrome has a XPath Helper plug-in that makes it easy to test whether the syntax of XPath is written correctly.

Because players may have multiple positions, the most people I have ever seen have four positions. So I made an offset in the following code to ensure that the intercepted part of the string is correct.

If there are more than one position, translate, or otherwise the intercepted string will be wrong offset=rawdata [0] .find _ all ("span") offset= (len (offset))-1 temp=rawdata [0] .text temp=re.split ('\ sbirthday calendar temp) if offset > 0:for I inrange (offset): temp.pop (I) birthday information and conversion to get birthday information, and converted to the format we need. It is mentioned here that when the format of "day / month / year" is opened by excel, it will be automatically converted to date format, which is troublesome. What I do is: either use wps or open it with a flying book and paste it back. If you have a better way, you are welcome to leave a message.

Here is the height and weight, which is very simple to intercept the string.

# get the player's birthday and convert it to the desired format (DAY/MONTH/YEAR) month= ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov" "Dec"] # birthday=temp [3] [1:] +'-'+ temp [4] [:-1] + temp [5] [:-1] mon=temp [3] [1:] mon=month.index (mon) + 1 day=temp [4] [:-1] year=temp [5] [:-1] birthday= [f "{str (year)} / {str (mon)} / {str (day)}"] birthday=eval (str (birthday) [1RV- 1]) myList.end (birthday) # height and weight height=int (temp [6] [:-2]) myList.end (height) weight=int (temp [7] [:-2]) myList.end (weight) get Profile We need to get the Profile information on the left side of the page Including forward and reverse foot, skill level, offensive and defensive participation and so on.

The left foot is defined as 1 and the right foot is defined as 2. This magic number (Magic Number) exists in a large number of projects. Can only smile:)

# get profile (forward and backward, technical action, international reputation, etc.) rawdata= soup.select ("# body > div:nth-child (5) > div > div.col.col-12 > div:nth-child (2) > div > ul") temp=rawdata [0] .find _ all ('li') Class_= "ellipsis") preferred_foot=temp [0] .contents [1] preferred_foot= 1if (preferred_foot= = 'Left') else2 myList.end (preferred_foot) skill_move_level=temp [2] .contents [0] myList.end (int (skill_move_level)) reputation=temp [3] .contents [0] myList.end (int (reputation)) todostr=temp [4] .text workrateString=re.split Todostr) wr_att=workrateString [1] [4wrList= 1] wr_def=workrateString [2] wrList= ['Low', "Medium", "High"] wr_att=wrList.index (wr_att) + 1 wr_def=wrList.index (wr_def) + 1 myList.end (wr_att) myList.end (wr_def) The most code is used to split\ concatenated strings.

Under the avatar to get the avatar, all kinds of pictures are handled in the same way, can be said to be the most useful part of the reptile (error).

The avatar needs to get the url address of img, and then download it using stream. The most important thing to pay attention to here is the naming of the picture, don't down down after you can't tell the difference, confused. (there is a mixture of Chinese and English when this passage is typed, but img= "picture", url= "address" and stream= "stream" will find it very awkward. I would like to ask you for guidance on how to type a culturally confident code tutorial in pure Chinese.)

# avatar rawdata=soup.select ("# body > div:nth-child (5) > div > div.col.col-12 > div.bp3-card.player > img") img_url=rawdata [0] .get ("data-src") img_r=requests.get (img_url,stream=True) # print (img_r.status_code) img_name = f "{id} _ {playerName} .png" with open (f "XVX / here is the path, everyone is different I can't show you / {img_name}, "wb") as fi:for chunk in img_r.iter_content (chunk_size=120): fi.write (chunk) aside: the pictures on many web pages are downloaded back and found to be in WebP format. That is, Google has created a format. You can download the "Save Image as Type" plug-in, right-click to save as PNG or JPG.

Other information: other location information, club information, and nationality information, all use the same method-where you can't click, right-click and copy a selector.

# # get position rawdata= soup.select ("# body > div:nth-child (5) > div > div.col.col-12 > div.bp3-card.player > div > div > span") allPos = '.join (f "{p.text}" for p in rawdata) myList.end (allPos) rawdata= soup.select ("# body > div:nth-child (6) > div > div.col.col-4 > ul > li:nth-child (1) > span ") bestPos=rawdata [0] .text myList.end (bestPos) # get club rawdata= soup.select (" # body > div:nth-child (5) > div > div.col.col-12 > div:nth-child (4) > div > h5 > a ") club = rawdata [0] .te xt iflen (rawdata) > 0else" no club "myList.end (club) # acquire nationality rawdata= soup.select (" # body > div : nth-child (5) > div > div.col.col-12 > div.bp3-card.player > div > div > a ") nation = rawdata [0] .get (" title ") iflen (rawdata) > 0else" country "myList.end (nation) attribute is coming These 70 or 80 attributes are the most troublesome to copy manually, so this crawler is written.

The analysis shows that the value of each attribute is also written in the name of the class, for example, "class=bp3-tag p / pMel 73", the commonness is the part of "bp3-tag p", so regular expressions need to be used (in fact, re is regular, and I don't think you will search for it yourself if you don't understand, so don't say much).

Just paste, and finally return the properties as a list, and the crawler body function is complete.

# get the attribute rawdata=soup.select ('# body > div:nth-child (6) > divdiv.col.col-12') data=rawdata [0] .find _ all (class_=re.compile ('bp3-tagp')) # print (data) myList.extend (allatt.textforallattindata) returnmyList write to the file, finish the written function before starting the next step. Otherwise, the data that is not easy to climb to, only in memory, will be easily lost. Many non-programmers may not understand that this process is called "persistence". As the saying goes, "do not judge by the long and short, only by lasting to enter the world," what is said is the code.

It is recommended to use csv for writing, as well as other formats. If you want to write excel, it is recommended to use the openpyxl library. The following time code section, where the longest is the header of the table.

# write file def dealWithData (dataToWrite): header_list = ['id','name','birthday','height','weight','preferred_foot', "skill_move_level", "reputation", "wr_att", "wr_def",' Positions','Best Position','Club', "nation", 'Crossing','Finishing','Heading Accuracy',' Short Passing','Volleys','Dribbling','Curve',' FK Accuracy','Long Passing' 'Ball Control','Acceleration','Sprint Speed','Agility','Reactions','Balance','Shot Power','Jumping','Stamina','Strength','Long Shots','Aggression','Interceptions','Positioning','Vision','Penalties','Composure','Defensive Awareness','Standing Tackle','Sliding Tackle','GK Diving','GK Handling','GK Kicking','GK Positioning' 'GK Reflexes'] with open ('. / casually written / not recommended Chinese names .csv', 'axioms, encoding='utf-8-sig', newline='') as f: writer = csv.writer (f) writer.writerow (header_list) writer.writerows (dataToWrite) several other modes about writing: the use of wpentry an and + Please search by yourself (it's so easy to write tutorials).

How does search id call the above function? Where does the needed player id come from? Here I use two methods, which are introduced respectively:

Incremental ID first used incremental id to traverse, which belongs to the way of casting net widely and collecting more fish. This approach is very crappy, through this search to find a lot of website pages will not display player data, such as women's football player data.

# the actual code is no longer used. I'll write an example here: soData = [] for s in range (20000 and 40000): l=fetchData (s) if lumped codes none: soData.end (l) dealWithData (soData) so that if you search for one entry and write one, the efficiency is very low. You can search in batches, for example, 100 entries at a time, and then write them as a whole. You can comment out the header_list entry when you write CSV, and you don't need to write header so many times.

Id list We use a csv file, add the id that needs to be searched, and then read the list for targeted search!

# search list searchList= [] with open ('. / directory look at yourself / need to search id.CSV', "r", encoding='utf-8-sig') as f: f_csv=csv.reader (fmae dialectals excelists dint delimiterial examples) searchList.extend (iter (f_csv)) # print (len (searchList)) # search soData = [] for p in searchList: # because ID reads it as a string So to intercept soid=str (p) [2soid 2] l = fetchData (soid) if lumped colors none: soData.end (l) dealWithData (soData), that's fine. We need to get the id on the player's sofia website. Here I have through name search, through ovr search, and through club search, respectively put below.

Search by player name

On this site, by name search, a list of players will appear. For example, a search for Valentino will show the following players:

Don't say much, just go to the code:

DefgetPlayerID (key): url = f "https://sofifa.com/players?keyword={str(key)}" myRequest=requests.get (url) soup=BeautifulSoup (myRequest.text) 'lxml') playerTable=soup.select ("# body > div.center > div > div.col.col-12 > div > table > tbody") # print (len (playerTable [0] .contents) data=playerTable [0] .contents playersCandicate= [] iflen (data) > 0:for p in data:id=p.find ("img") ["id"] name=p.find ("a") ["aria-label") ovr=p.find (attrs= {"data-col": "oa"}). Get_text () playersCandicate.end ([id Name,ovr]) else:print ("not found") playersCandicate.end (["not found", "the name you're searching is > >", keyword]) return playersCandicate this function will get all the search results If not, it will return "not found". It is important to note that many players with similar names will be found. As for which one is really needed, you need to filter it yourself.

Similarly, put the name you want to search in a csv for ease of use.

# read the list of searches to be searched: searchList= [] with open ('toSearchByName.CSV', "r", encoding='utf-8-sig') as f: f_csv=csv.reader (fmending dialectals excellel') searchList.extend (iter (f_csv)) # to search Note that all players of the same name will be searched out idata = [] for p in searchList: keyword=str (p) [2:len (p)-3] l = getPlayerID (keyword) if lumped players none: idata.end (l) dealWithData (idata) search by the player's total attribute value (OVR) when searching through OVR.

Click on search, found that the URL has become like this, you can see that oal is overall low,oah is the meaning of overall high.

The code is as follows:

# enter OVR minimum, maximum and number of pages (60 pages per page) def searchByOVR (min,max Pages): i=min pair0 playersCandicate= [] while i0: for all in data: id=all.find ("img") ["id"] name=all.find ("a") ["aria-label"] ovr=all.find (attrs= {"data-col": "oa"}). Get_text () playersCandicate.end ([id,name Ovr]) when pairing 1 packs 0 iTunes is called, for example, search 65 to 75 for a total of 10 pages. SearchByOVR (65pm 75je 10) search by team search words Need to know club's id, we choose teams, we can see its only club id and starting line-up.

Here's how to get a starting line-up through club id:

# get the team def getLineup (id): url = f 'https://sofifa.com/teams/{str(id)}' myRequest=requests.get (url) soup=BeautifulSoup (myRequest.text) 'lxml') clubName=soup.find ("H1"). Text if clubName= =' Teams': return None lineup=soup.select ("# body > div:nth-child (4) > div > div.col.col-12 > div > div.block-two-third > div > div") data=lineup [0] .find _ all ("a") field_player= [] if len (data) > 0: for p in data: temp=str (p.attrs ["href"]) temp=temp.lstrip ("/ player/") temp=temp.rstrip ("/") id=temp [: temp.find ("/")] field_player.end ([clubName Id, p.attrs ["title"], p.text [: 2]]) return field_player as to how to get club id Like previous players, either use an incremental id to record it, or search for the team's name without repeating it.

To sum up, as the saying goes, life is short, I use python. As a scripting language, fast and simple are the most important features of python. You can customize this kind of crawler according to your own needs, and more advanced frameworks for crawlers can use scappy and so on. For common utility functions, such as write csv, write / read excel, etc., you can write in a misc.py according to your needs. In fact, because there are often new requirements, it is written casually, and a lot of comments are not written. There is no aesthetic feeling in this kind of powerful brick writing. New requirements come one after another, and there is no time to restructure. Thank God to be able to run. After seeing the completion of the exited with in seconds, I never want to open it again. I hope everyone will take this as a warning and be able to write easy-to-understand code.

This article comes from the official account of Wechat: game Design of Thousand Monkeys and horses (ID:baima21th). Author: thousand taels.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

IT Information

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report