But we all know how the internet works. Whenever some shiny new AI toy is released there’s always a lot of chatter and excitement. Sometimes it lives up to the hype. Other times it fizzles out and turns into a historical
.Where will DeepSeek end up?
Only time will tell. But if you’re curious about it and want to get an overview of what it’s capable of without testing it yourself, then you’re in the right place. After reading about ten articles on how amazing it is, I decided to take it for a spin. Keep reading to learn more about the experiments I ran and what I discovered about DeepSeek’s strengths and limitations.
Oh, and there was that “censorship feature” as well. 😱
My testing setup and methodology
I spent a few hours of quality time with DeepSeek this past Sunday and ran it through a battery of tests. For some of the more technical ones I asked Claude 3.5 Sonnet to generate a prompt for me and I fed this prompt to both DeepSeek and GPT-o1. In addition, I asked Claude to produce an answer to its own prompt.
I then read the individual responses, and for an even deeper insight, I cross-referenced them by giving each model the answers of the other two. Reading through all the analyses gave me a strong understanding of each model’s capabilities.
As for the tests, I tried to make all of them a different type. This way we could see how DeepSeek handles knowledge across subjects and task types. These were the tests:
- Solving a complex mathematical problem that requires a high level of reasoning capability.
- Generating Python code to create a program that analyzes a string of text and builds a word ladder.
- Identifying common scientific misconceptions and explaining them to a middle schooler.
- Answering questions about political leaders, historical figures, and atrocities.
⚠️ A QUICK DISCLAIMER BEFORE WE GET INTO IT
My testing, while relatively thorough for one person on a Sunday afternoon tinkering with AI, is still exactly that. It’s one person spending a few hours running tests. AI systems are complex and frequently updated, so your experience might differ from what I observed. Nonetheless, I think you’ll find the results interesting – particularly because they don’t fully line up with the current media hype.
Mathematical reasoning
To test DeepSeek’s mathematical reasoning abilities, I asked Claude 3.5 Sonnet to create a challenging optimization puzzle. The prompt needed to be complex enough to require careful analysis, but straightforward enough that we could easily verify the correct answer.
Claude designed a scenario about a toy factory with three production lines and limited workers:
The results
GPT-o1 wrote the most comprehensive solution, methodically explaining multiple valid ways to reach the 1,080-toy maximum. It spotted that Lines A and C produced 60 toys per worker-hour, while Line B lagged at 50 – a crucial insight that DeepSeek missed entirely.
Claude’s solution, while reaching the same correct number, took a more direct route. It identified the most efficient lines and allocated workers accordingly, but it didn’t explore alternative ways to arrive at 1,080 like GPT did.
DeepSeek moved fast, but arrived at a less efficient solution of 900 toys per hour. It also excelled in presentation. Its response came formatted with clean headers and precise mathematical notation. Unfortunately, neat formatting couldn’t compensate for the fundamental flaw in its reasoning.
Coding aptitude
For the next test, I once again turned to Claude for assistance in generating a coding challenge. It created a classic computer science problem – building a word ladder. The challenge required finding the shortest chain of words connecting two four-letter words, changing only one letter at a time. For example, turning “COLD” into “WARM” through valid intermediate words.
Since there was more than one way to arrive at the solution, it made for some interesting results. Each of the three tools did it their own way, but the winner was once again…
The results
GPT-o1’s code stood out as the most polished of the three. By generating word connections on-demand rather than preprocessing everything, it struck a balance between efficiency and memory usage. The extensive documentation and clean organization made it feel like something you’d find in a professional codebase.
Claude’s solution preprocessed the entire word graph before searching. It prioritized speed for multiple queries at the cost of initial memory usage. The approach was akin to reading an instruction manual before beginning a project, taking more time upfront, but making the process faster once started.
Finally, DeepSeek’s approach, while functional, lacked the sophistication of the other two. It took a more direct path to solving the problem but missed opportunities for optimization and error handling.
Scientific understanding
To test DeepSeek’s ability to explain complex concepts clearly, I gave all three AIs eight common scientific misconceptions and asked them to correct them in language a middle school student could understand. The topics ranged from basic physics (why metal feels colder than wood) to astronomy (what causes Earth’s seasons).
Calling a clear winner on this test was challenging because even though there were objective facts at play, there were also subjective nuances. Obviously science is science, so if something is not factually sound, that’s one thing. But if you look at the prompt, I set a target audience here – middle school students. This means that I wasn’t only looking for accuracy, but also delivery. And therein lies the subjective element.
The results
Overall, all three models excelled in their own way and rather than one being better than another, it was more like each had their own strengths and weaknesses. There were also some mild disagreements.
GPT-o1 delivered a rapid, well-structured response. Each explanation flowed logically from identifying the error to providing the correct science, using relevant examples like comparing heat energy in a hot cup versus a cool swimming pool.
It mostly maintained middle-school-appropriate language. However, there were a few words that I’m not sure every middle schooler would understand (e.g., thermal equilibrium, thermal conductor). Then again, it’s been a very long time since I was that age so what do I know?! 🤷♂️
Claude matched GPT-o1’s scientific accuracy but took a more systematic approach. Rather than providing standalone explanations, it built connections between concepts. For example, it illustrated how understanding thermal conductivity helps explain both why metal feels cold and how heat moves through different materials. The only minor drawback I found was the same as GPT, which is that I wasn’t entirely convinced that all of the explanations were written at a middle school level.
DeepSeek’s responses were scientifically accurate but sometimes repetitive without adding depth. When explaining warm air rising, for instance, it restated the same basic concept three times instead of building toward deeper understanding. On the plus side, it did excel at keeping technical language simple and accessible. I felt that it came the closest to that middle school level that both GPT-o1 and Claude seemed to overshoot.
Nuances in conclusions
You may have noticed that I deliberately chose to use the same response for all three screenshots above. It’s because this particular one had the most “disagreement.” GPT and Claude said similar things but drew opposite conclusions, while DeepSeek didn’t even mention certain elements that the other two did.
GPT said the statement was correct because pressure and salt can make water boil at higher temperatures, which leads to faster cooking. Claude declared it incorrect. Even though it noted that pressure cookers can achieve higher cooking temperatures, it considered pressure as an external factor and not applicable to the original statement.
DeepSeek’s explanation, while not incorrect, missed these practical details entirely. It simply stated that cooking speed depends on water temperature without exploring any of the factors that could actually change that temperature. Nonetheless, I still think that DeepSeek had a strong showing on this test. Certainly stronger than on the previous tests.
DeepSeek’s political boundaries
The final test I ran was more of a “just for fun” experiment to see how DeepSeek would react when asked about Chinese history and Chinese leaders versus those outside of China. I didn’t compare these responses with GPT-o1 or Claude. It was more to see how censorship works on the platform.
Bill Clinton versus Chinese leaders
I began by giving DeepSeek a very neutral, informative query: Tell me about Bill Clinton.
And it responded with an equally neutral, very balanced take on the former U.S. president:
All things considered it was a standard AI response. I followed up by giving it the same exact prompt, but substituting Bill Clinton for Jiang Zemin – who was president of China during roughly the same time period that Clinton served. I immediately got back the following message: Sorry, that’s beyond my current scope. Let’s talk about something else.
I wanted to see how many presidents back this censorship would go, so I repeated the same prompt for the president before Jiang Zemin.
And then the one before him, and so on:
I was convinced the pattern would continue ad infinitum. But then all of a sudden, when I asked about Chen Yun, it gave me a complete answer – just like it did for Bill Clinton.
At that point I was surprised but also curious to see what it would do for the next preceding Chinese leader – Deng Xiaoping:
And it was right back to censorship mode. I played around with a few more Chinese figures from recent history, mixing those who had more of an economic impact on China with those who can be considered more political. The results continued to surprise me as I couldn’t find a clear pattern or possible criteria that DeepSeek might be using to decide which people to censor and which to allow.
Historical atrocities
For this test, I wanted to see if DeepSeek handled historical events differently than political figures. I started with an event outside China’s borders – the My Lai massacre committed by American troops during the Vietnam War. DeepSeek provided a thorough, balanced account similar to its Bill Clinton biography.
Next, I asked about the Tiananmen Square massacre. As expected, DeepSeek immediately gave its standard deflection: Sorry, that’s beyond my current scope. Let’s talk about something else.
But then a funny thing happened.
I followed up my Tiananmen query by asking about Mao Zedong’s Cultural Revolution. DeepSeek initially produced an entire full-length response about the decade-long sociopolitical movement. Then, a few seconds later, the text vanished, replaced by the same deflection message from before:
This behavior suggests at least two distinct filtering systems:
- An immediate block for certain keywords or phrases.
- A secondary review that catches potentially sensitive content even after it’s been generated.
I continued by trying several variations of historically sensitive topics, alternating between Chinese and non-Chinese events. The pattern held: immediate blocks for some topics, delayed censorship for others, and full responses for non-Chinese historical events.
Final thoughts
In my humble opinion, DeepSeek is not the GPT killer that it was made out to be all last week – at least not yet. Don’t get me wrong, it’s impressive. Especially considering the vast funding gap and limited resources they worked with to create it. 1
But as you saw for yourself, GPT-o1 dominated the tests I ran and Claude did better as well. Though the Claude results were also influenced by the fact that it was the creator of most of the tests so it’s not exactly a fair comparison in that sense.
I’m also not naive and realize that my tiny little experiment here isn’t the equivalent of the rigorous tests they run these AI models through when they do the official benchmarks. On the flip side, it does give you a perspective of the “average user” experience. And that has its own inherent value.
Have you tried DeepSeek yet? What do you think? Is it over-hyped or is it really that good? Let me know in the comments. I’ll see you there.